Project 1.1: A RISC-V Assembler (Individual Project)

Computer Architecture I ShanghaiTech University

Project 1.1

IMPORTANT INFO - PLEASE READ

The projects are part of your design project worth 2 credit points. As such they run in parallel to the actual course. So be aware that the due date for project and homework might be very close to each other! Start early and do not procrastinate.


Introduction to Project 1.1

In Project 1.1, you are going to make a simple one-pass RISC-V assembler. The assembler takes RISC-V codes which contain no labels and symbols as input and outputs corresponding machine codes. You also need to implement basic error handling to detect invaild instructions. You can fetch the framework for Project 1.1 here on Github classroom, try to use git and Github for version control.

Background of The Instruction Set

Registers

Please consult the RISC-V Green Sheet (PDF) for register numbers, instruction opcodes, and bitwise formats. Our asembler will support all 32 registers: zero, ra, sp, gp, tp, t0-t6, s0 - s11, a0 - a7. Other register numbers (eg. x0, x1, x2 etc.) shall be also supported. Note that floating point registers are not included in this project.

Instructions

We will have 42 instructions and 6 pseudo-instructions to assemble. The instructions are:

Instruction Type Opcode Funct3 Funct7/IMM Operation
add rd, rs1, rs2 R 0x33 0x0 0x00 R[rd] ← R[rs1] + R[rs2]
mul rd, rs1, rs2 0x0 0x01 R[rd] ← (R[rs1] * R[rs2])[31:0]
sub rd, rs1, rs2 0x0 0x20 R[rd] ← R[rs1] - R[rs2]
sll rd, rs1, rs2 0x1 0x00 R[rd] ← R[rs1] << R[rs2]
mulh rd, rs1, rs2 0x1 0x01 R[rd] ← (R[rs1] * R[rs2])[63:32]
slt rd, rs1, rs2 0x2 0x00 R[rd] ← (R[rs1] < R[rs2]) ? 1 : 0
sltu rd, rs1, rs2 0x3 0x00 R[rd] ← (U(R[rs1]) < U(R[rs2])) ? 1 : 0
xor rd, rs1, rs2 0x4 0x00 R[rd] ← R[rs1] ^ R[rs2]
div rd, rs1, rs2 0x4 0x01 R[rd] ← R[rs1] / R[rs2]
srl rd, rs1, rs2 0x5 0x00 R[rd] ← R[rs1] >> R[rs2]
sra rd, rs1, rs2 0x5 0x20 R[rd] ← R[rs1] >> R[rs2]
or rd, rs1, rs2 0x6 0x00 R[rd] ← R[rs1] | R[rs2]
rem rd, rs1, rs2 0x6 0x01 R[rd] ← (R[rs1] % R[rs2]
and rd, rs1, rs2 0x7 0x00 R[rd] ← R[rs1] & R[rs2]
lb rd, offset(rs1) I 0x03 0x0 R[rd] ← SignExt(Mem(R[rs1] + offset, byte))
lh rd, offset(rs1) 0x1 R[rd] ← SignExt(Mem(R[rs1] + offset, half))
lw rd, offset(rs1) 0x2 R[rd] ← Mem(R[rs1] + offset, word)
lbu rd, offset(rs1) 0x4 R[rd] ← U(Mem(R[rs1] + offset, byte))
lhu rd, offset(rs1) 0x5 R[rd] ← U(Mem(R[rs1] + offset, half))
addi rd, rs1, imm 0x13 0x0 R[rd] ← R[rs1] + imm
slli rd, rs1, imm 0x1 0x00 R[rd] ← R[rs1] << imm
slti rd, rs1, imm 0x2 R[rd] ← (R[rs1] < imm) ? 1 : 0
sltiu rd, rs1, imm 0x3 R[rd] ← (U(R[rs1]) < U(imm)) ? 1 : 0
xori rd, rs1, imm 0x4 R[rd] ← R[rs1] ^ imm
srli rd, rs1, imm 0x5 0x00 R[rd] ← R[rs1] >> imm
srai rd, rs1, imm 0x5 0x20 R[rd] ← R[rs1] >> imm
ori rd, rs1, imm 0x6 R[rd] ← R[rs1] | imm
andi rd, rs1, imm 0x7 R[rd] ← R[rs1] & imm
jalr rd, rs1, imm 0x67 0x0 R[rd] ← PC + 4
PCR[rs1] + imm
ecall 0x73 0x0 0x000 (Transfers control to operating system)
a0 = 1 is print value of a1 as an integer.
a0 = 4 is print the string at address a1.
a0 = 10 is exit or end of code indicator.
a0 = 11 is print value of a1 as a character.
sb rs2, offset(rs1) S 0x23 0x0 Mem(R[rs1] + offset) ← R[rs2][7:0]
sh rs2, offset(rs1) 0x1 Mem(R[rs1] + offset) ← R[rs2][15:0]
sw rs2, offset(rs1) 0x2 Mem(R[rs1] + offset) ← R[rs2]
beq rs1, rs2, offset SB 0x63 0x0 if(R[rs1] == R[rs2])
 PCPC + {offset, 1b'0}
bne rs1, rs2, offset 0x1 if(R[rs1] != R[rs2])
 PCPC + {offset, 1b'0}
blt rs1, rs2, offset 0x4 if(R[rs1] < R[rs2])
 PCPC + {offset, 1b'0}
bge rs1, rs2, offset 0x5 if(R[rs1] >= R[rs2])
 PCPC + {offset, 1b'0}
bltu rs1, rs2, offset 0x6 if(U(R[rs1]) < U(R[rs2]))
 PCPC + {offset, 1b'0}
bgeu rs1, rs2, offset 0x7 if(U(R[rs1]) >= U(R[rs2]))
 PCPC + {offset, 1b'0}
auipc rd, offset U 0x17 R[rd] ← PC + {offset, 12b'0}
lui rd, offset 0x37 R[rd] ← {offset, 12b'0}
jal rd, imm UJ 0x6f R[rd] ← PC + 4
PCPC + {imm, 1b'0}

NOTE: Since our assembler is a one-pass assembler, the offset in SB and U type and imm in UJ type will be integers.

The pseudo-instructions are:

Pseudo-instruction Format Uses
Branch on Equal to Zero beqz rs1, label beq
Branch on not Equal to Zero bnez rs1, label bne
Jump j label jal
Jump Register jr rs1 jalr
Load Immediate li rd, immediate lui, addi
Move mv rd, rs1 addi

For further reference, here are the bit lengths of the instruction components.

R-TYPE funct7 rs2 rs1 funct3 rd opcode
Bits 7 5 5 3 5 7

I-TYPE imm[11:0] rs1 funct3 rd opcode
Bits 12 5 3 5 7

S-TYPE imm[11:5] rs2rs1 funct3 imm[4:0] opcode
Bits 7 5 5 3 5 7

SB-TYPE imm[12] imm[10:5] rs2 rs1 funct3 imm[4:1] imm[11] opcode
Bits 1 6 5 5 3 4 1 7

U-TYPE imm[31:12] rd opcode
Bits 20 5 7

UJ-TYPE imm[20] imm[10:1] imm[11] imm[19:12] rd opcode
Bits 1 10 1 8 5 7

Getting Started

File Structure and Usage

The directory tree of the framework should like the following:


    .
    ├── inc
    │   ├── assembler.h
    │   └── util.h
    ├── Makefile
    ├── main.c
    ├── src
    │   ├── assembler.c
    │   └── util.c
    └── test
        ├── test.ref
        └── test.S
  

main.c is the entry of the whole assembler. You should not modify this file.

assembler.c and assembler.h are where you implement the assembler function.

util.c and util.h contain some helper functions. You can also add useful functions there.

test directory contains a basic test and the correspoding result.

Build & Execute

  1. Run make to compile the code and assembler executable file will be main
  2. Or you can build the code with CMake. First make a directory build. Then run cmake .. && make under build. The executable file will be build/main
  3. To run the assembler, type main input_file output_file . input_file contains RISC-V instructions (see below for detailed description). output_file is where you output your results to.
  4. Run make test to test your codes with test/test.S and your output file will be test/test.out

Input & Output

Input

Input will be a file containing RISC-V instructins. You can assume there are no empty rows and comments and each line ends with a \n. We will use space as delimiter instead of comma, e.g. add x1 x2 x3.

Output

Output shoud be RISC-V machine codes. You should use function dump_code in src/util.c when outputing machine codes. This function will requrie a file handler and a uint32_t variable as parameters, which should be the output file and code to be dumped. Do not use your own output function, otherwise, there may be format problems. Also, do no change the output format in dump_code since we will use your util.c when grading.

Error Handling

If the input file contains some illegal instructions, you should find it and output error information to the output file. You should use function dump_error_information in src/util.c for outputing error information. Once an error occurs, you should continue to assemble the rest instructions and keep outputing results and errors. Also, you should not directly finish the whole program using exit. Quiting unexpectedly will be viewed as run time error.

To simplify the error handling part, we promise that there will only be one space between each string. Also, you do not need to handle cases where there are more or less parameters in an instruction, like addi a0 a1 or addi a0 a0 a0 1. Load/Store instructions will always be the correct format, e.g. lw a0 0(a1). But the correctness of registers and offset is not guaranteed.

Here are situations you need to consider in this project:

  1. Non-existent instruction: All supported instructions are listed above and any other instrcutions should be viewed as illegal.
  2. Bad registers: Wrong names of register or registers which are out of scope should be detected, e.g. rp, x32. You don't need to handle situations like x01 and a-1
  3. Bad immediate or offset: The imm or offset in instructions may not be a number, e.g. addi a0 a1 a0.
  4. Immediate out of range: The immediate in some instructions should be limited into some scope, since the number of bits to represent imm is limited. For example, imm in addi should be between -2048 and 2047. You can refer to Venus and the RISC-V manual for more information about the limitation.

Testing

Diff

Use diff file1 file2 to compare your output with the reference answer. Note that we will use diff to check your answer. To see how to interpret diff results, click here

Valgrind

To check memory leak, you can use Valgrind by running valgrind --tool=memcheck --leak-check=full --track-origin=yes main input_file output_file

Venus

Venus is a powerful assembler and you can use Venus to test the correctness of your code.

First type RISC-V instructions at the editor page. Then at the simulator page, you can see the machine code of each instruction. You can also use Dump button to collect all machine codes as a reference.

Tips

  1. Immediate in auipc and lui should be between 0 and 1048575. Venus views this immediate as an unsigned integer by defulat, while the official manual does not mention this. We choose to follow Venus. For auipc, since the starting address of text is smaller than that of data, PC-relative addresses are always larger than current PC, causing non-negative offset. For lui, it will load upper part of the immediate into the register, which does not care about the sign.
  2. Immediate in jal should be between -1048576 and 1048575. Venus does not limit this immediate for some reasons, even if immediates out of this range can not be represented. However, we are going to follow the hardware limitation.
  3. This project needs a lot of spliting operations. You may find strtok useful.
  4. You need to check whether the immediate in li instruction is between -2048 and 2047. If so, li should be translated into only one addi instruction. Otherwise, it will be translated into lui and addi
  5. Try to generate your own test cases. Codes need testing.
  6. Don't forget writing comments frequently.

Submission

You should submit your code via Github. Please follow the guidance in Gradescope to submit your codes on Github. Note that we will not use your main.c or Makefile for grading. The compilation flag will be -Wpedantic -Wall -Wextra -Wvla -Werror -std=c11.